Definition 1. (Flow): A flow is a collection of time-indexed vector fields v={vt}t∈[0,1].
Any flow defines a trajectory taking initial points x1 to final points x0, by transporting the initial point along the velocity fields {vt}. It is equivalent to a transfer between two distributions.
Formally, for velocity field v and initial point x1, consider the ODE
with initial condition x1 at time t=1. We write
to denote the solution to the flow ODE at time t, terminating at final point x0. That is, RunFlow is the result of transporting point x1 along the flow v up to time t.
Flows also define transports between entire distributions by pushing forward points from the source distribution along their trajectories. If p1 is a distribution on initial points, then applying the flow v yields the distribution on final points:
This process is denoted as p1vp0, meaning the flow v transports initial distribution p1to final distribution p0.
THE ULTIMATE GOAL OF FLOW MATCHING is to learn a velocity field v∗ which transport p1v∗p0, where p0 is the target distribution and p1 is some easy-to-sample base distribution (such as Gaussian).
Continuous Normalizing Flows
Let Rd denote the data space with data points x=(x1,…,xd)∈Rd.
The probability density pathp:[0,1]×Rd→R>0 is a time-dependent probability density function, i.e., ∫pt(x)dx=1.
v:[0,1]×Rd→Rd is a time-dependent vector field.
A vector field vt can be used to construct a time-dependent diffeomorphic map, called a flow, ϕ:[0,1]×Rd→Rd, defined via the ordinary differential equation (ODE):
Here ϕt(x) is a solution to the ODE, and we call it a flow. We can model the vector field vt with a neural network vt(x;θ), where θ∈Rp are its learnable parameters.
2. The Continuity Equation and Fokker-Planck Equation
Continuity Equation
How to test if a vector field vt generates a probability path pt?
Conditional VFs for Fokker-Planck probability paths
Consider a Stochastic Differential Equation (SDE) of the standard form:
dy=ftdt+gtdw
with time parameter t, drift ft, diffusion coefficient gt, and dw is the Wiener process.
The solution to the SDE is yt, which is a stochastic process (a continuous time-dependent variable). Its probability density pt(yt) is characterized by the Fokker-Planck equation:
dtdpt=−div(ftpt)+2gt2Δpt
where Δ represents the Laplace operator (in y), namely div∇, where ∇ is the gradient operator. We can rewrite the above equation in the form of the continuity equation:
satisfies the continuity equation with the probability path pt, and therefore generates pt.
3. Continuous Normalizing Flow (CNF)
A CNF is used to reshape a simple prior density p0 (e.g., pure noise) to a more complicated one, p1, via the push-forward equation.
pt=[ϕt]∗p0
The push-forward (or change of variable) operator ∗ is defined by:
[ϕt]∗p0(x)=p0(ϕt−1(x))det[∂x∂ϕt−1(x)]
📖
Change of variables in the probability density function
Suppose x is an n-dimensional random variable with joint density f. If y=G(x), where G is a bijective, differentiable function, then y has density pY:
pY(y)=f(G−1(y))det[dydG−1(y)]
Relationships between Flow, Vector Field, and Probability Density
Equation (2)(3) describe the relationship between flow ϕt(x) and vector field vt(x)
Equation (8)(9) describe how the flow ϕt(x) changes the density from p0 to pt.
Equation (4) (continuity equation) gives a necessary and sufficient condition to test whether a vector field vt generates a probability path pt.
4. Flow Matching Objective
Notations:
x1: a random variable distributed according to unknown distribution q(x1). We have access to samples from q(x1), but no access to the density function.
pt: a probability path such that p0=p is a simple distribution, e.g., the standard normal p(x)=N(x∣0,I)
p1: approximately equal in distribution to q.
The Flow Matching objective is designed to match this target probability path, which will allow us to flow from p0 to p1.
Given a target probability density path pt(x) and a corresponding vector field ut(x), which generates pt(x), the Flow Matching (FM) objective is defined as:
LFM(θ)=Et,pt(x)∣∣vt(x)−ut(x)∣∣2
where θ denotes the learnable parameters of the CNF vector field vt (here is a neural network), t∼U[0,1], and x∼pt(x).
Problem: the above objective assumes that the density pt(x) and the ground truth vector field ut(x) are known, but we don’t.
Solution: Use conditional probability density for unknown items, and verify it’s valid.
5. From pt(x∣x1) and ut(x∣x1) to pt(x) and ut(x)
Construct a target probability path via a mixture of simpler probability paths:
Given a sample x1, let pt(x∣x1) denote a conditional probability path such that p0(x∣x1)=p(x) at time t=0 and p1(x∣x1) at t=1 to be a distribution concentrated around x=x1, e.g., p1(x∣x1)=N(x∣x1,σ2I) where σ is sufficiently small.
The marginal probability path can be given by:
pt(x)=∫pt(x∣x1)q(x1)dx1
In particular at time t=1, the marginal probability p1 is a mixture distribution that closely approximates the data distribution q:
p1(x)=∫p1(x∣x1)q(x1)dx1≈q(x)
☝
Why p1(x) approximates q(x).
The Dirac delta function (δ distribution, unit impulse) is defined as:
The conditional distribution p1(x∣x1)=N(x∣x1,σ2I) with sufficient small σ is similar to the Dirac delta function, therefore approximately has the sifting property:
Also, we can define the conditional vector fields in the following form:
ut(x)=∫ut(x∣x1)pt(x)pt(x∣x1)q(x1)dx1
where ut(⋅∣x1):Rd→Rd is a conditional vector field that generates pt(⋅∣x1)
Next, we want to prove that:
The vector field ut(x) in equation(13) generates the probability path pt(x) in equation(11).
6. The form of ut(x)
Theorem 1. Given vector field ut(x∣x1) that generate conditional probability paths pt(x∣x1), for any distribution q(x1), the marginal vector field ut in equation (13) generates the marginal probability path pt in equation (11), i.e., ut and pt satisfy the continuity equation (equation (4)).
Proof:
To proof the theorem, we need to check that pt and ut satisfy the continuity equation.
Since ut(x∣x1) generates pt(x∣x1), by continuity equation we have:
It shows that ut(x) and pt(x) satisfy the continuity equation.
End of proof.
7. Conditional Flow Matching (Objectives are consistent)
There are still problems. The probability path pt(x) in Equation (11) and the vector field ut(x) in Equation (13) are still intractable due to the integration.
To solve this problem, we introduce a new objective LCFM(θ) defined as:
to avoid ut(x). And the following theorem states that the new objective LCFM(θ) is equal to LFM(θ) up to a constant difference w.r.t. the model parameters θ.
The new objective is tractable as long as we can sample from pt(x∣x1) and compute ut(x∣x1). Consequently, this allows us to train a CNF to generate the marginal probability path pt, which approximates the unknown data distribution q at t=1. We don’t need to access to marginal probability path or marginal vector field. We simply need to design suitable conditional probability paths and vector fields.
Theorem 2. Assuming that pt(x)>0 for all x∈Rd and t∈[0,1], then, up to a constant independent of θ, LCFM and LFM are equal. Hence, ∇θLCFM=∇θLFM.
Proof:
Some assumptions to ensure the existence of integrals and the changing of integration order (by Fubini’s Theorem):
q(x) and pt(x∣x1) decrease to zero at sufficient speed as ∣∣x∣∣→∞.
Note that ∣∣ut(x)∣∣2 and ∣∣ut(x∣x1)∣∣2 are both constant w.r.t the parameters θ. Now we only need to prove that the expectations of the first two terms are equal.
Therefore, the second term of two objectives are also equal. This two results lead to:
LFM(θ)=LCFM(θ)+C⇒∇θLFM(θ)=∇θLCFM(θ)
End of Proof.
8. Derive Conditional Vector Field from Conditional Probability Path (Gaussian)
The Conditional Flow Matching objective works with any choice of conditional probability path and conditional vector fields.
Here we consider a family of Gaussian conditional probability path:
pt(x∣x1)=N(x∣μt(x1),σt(x1)2I)
where μ:[0,1]×Rd→Rd is the time-dependent mean of the Gaussian distribution, while σ:[0,1]×Rd→R>0 describes a time-dependent scalar standard deviation (std).
When t=0, μ0(x1)=0 and σ0(x1)=1, so that all conditional probability paths converge to the same standard Gaussian noise distribution.
When t=1, μ1(x1)=x1 and σ1(x1)=σmin, so that p1(x∣x1) is a concentrated Gaussian distribution centered at x1.
We decide to choose a simple form of flow (condition on x1):
ψt(x)=σt(x1)x+μt(x1)
If we inspect Equation (9), we can verify that ψt pushes the noise distribution p0(x∣x1)=p(x) to pt(x∣x1)
☝
Prove that [ψt]∗p0(x)=pt(x).
Proof:
Since ψt(x)=σt(x1)x+μt(x1), we can derive its inverse function:
Theorem 3.Let pt(x∣x1) be a Gaussian probability path as in Equation (14), and ψt its corresponding flow map as in Equation (15). Then, the unique vector field that defines ψt has the form:
9. Instances of Gaussian Conditional Probability Paths
The above formulation is general for arbitrary functions μt(x) and σt(x). They can be set to any differentiable function satisfying desired boundary conditions.
📖
Recap on the solution to the first order linear ODE
A first order linear ODE takes the form:
dydy+p(t)y=g(t)
The solution is given by:
y(t)=μ(t)∫μ(t)g(t)dt+c
where μ(t)=e∫p(t)dt and the constant c is determined by the boundary condition.
Example 1. Diffusion Conditional VFs.
Diffusion models start with data points and gradually add noise until it approximates the pure noise. The process can be formulated as stochastic processes, resulting in Gaussian conditional probability paths pt(x∣x1), with specific choices of mean ut(x1) and std σt(x1).
The Variance Exploding (VE) path has the form:
pt(x)=N(x∣x1,σ1−t2I)
where σt is an increasing function, σ0=0 and σ1≫1. This this case, μt(x1)=x1 and σt(x1)=σ1−t. By Theorem 3 we have:
p(t)=A, g(t)=Ax1, and A=σ1−tσ1−t′. So we have:
μ(t)=e∫p(t)dt=c1eAt
∫μ(t)g(t)dt=∫Ax1c1eAtdt=x1c1eAt+c2
ϕt(x)=c1eAtx1c1eAt+c2+c=x1+ce−At
According to the boundary condition ϕ0(x)=x, we can solve c=x−x1. Therefore,
ϕt(x)=x1+(x−x1)e−At=x1+(x−x1)e−σ1−tσ1−t′t
The Variance Preserving diffusion path has the form:
pt(x∣x1)=N(x∣α1−tx1,(1−α1−t2)I)
where αt=e−21T(t), T(t)=∫0tβ(s)ds, and β is the noise scale function. It provides the choices of μt(x1)=α1−tx1 and σt(x1)=1−α1−t2. Plug them into Theorem 3, we can get the corresponding vector field.
In contrast to diffusion conditional VF (VE and VP). this vector field is defined for all t∈[0,1]. The conditional flow that corresponds to ut(x∣x1) is
10. Derive Vector Fields from Fokker-Planck Equation
The conditional vector field derived from Theorem 3 for VE and VP diffusion paths, actually coincide with the vector fields that govern the Probability Flow ODE (equation 13, in [paper]).
Since the diffusion process runs from data at time t=0 to noise at time t=1, we need the following lemma to translate the diffusion VFs to our convention of t=0 corresponds to noise and t=1 corresponds to data:
Lemma 1. Consider a flow defined by a vector field ut(x) generating probability path pt(x). Then, the vector field u~t(x)=−u1−t(x) generates the path p~t(x)=p1−t(x) when initiated from p~0(x)=p1(x).
Conditional VFs for Fokker-Planck probability paths
Consider a Stochastic Differential Equation (SDE) of the standard form:
dy=ftdt+gtdw
with time parameter t, drift ft, diffusion coefficient gt, and dw is the Wiener process. The solution yt to the SDE is a stochastic process, i.e., a continuous time-dependent random variable, the probability density of which, pt(yt), is characterized by the Fokker-Planck equation:
dtdpt=−div(ftpt)+2gt2Δpt
where Δ represents the Laplace operator (in y), namely div∇, where ∇ is the gradient operator (also in y). This equation can be rewrite into the continuity equation form: